268 research outputs found

    Speaker adaptation of an acoustic-to-articulatory inversion model using cascaded Gaussian mixture regressions

    No full text
    International audienceThe article presents a method for adapting a GMM-based acoustic-articulatory inversion model trained on a reference speaker to another speaker. The goal is to estimate the articulatory trajectories in the geometrical space of a reference speaker from the speech audio signal of another speaker. This method is developed in the context of a system of visual biofeedback, aimed at pronunciation training. This system provides a speaker with visual information about his/her own articulation, via a 3D orofacial clone. In previous work, we proposed to use GMM-based voice conversion for speaker adaptation. Acoustic-articulatory mapping was achieved in 2 consecutive steps: 1) converting the spectral trajectories of the target speaker (i.e. the system user) into spectral trajectories of the reference speaker (voice conversion), and 2) estimating the most likely articulatory trajectories of the reference speaker from the converted spectral features (acoustic-articulatory inversion). In this work, we propose to combine these two steps into the same statistical mapping framework, by fusing multiple regressions based on trajectory GMM and maximum likelihood criterion (MLE). The proposed technique is compared to two standard speaker adaptation techniques based respectively on MAP and MLLR

    Audio-Visual Speaker Conversion using Prosody Features

    Get PDF
    International audienceThe article presents a joint audio-video approach towards speaker identity conversion, based on statistical methods originally introduced for voice conversion. Using the experimental data from the 3D BIWI Audiovisual corpus of Affective Communication, mapping functions are built between each two speakers in order to convert speaker-specific features: speech signal and 3D facial expressions. The results obtained by combining audio and visual features are compared to corresponding results from earlier approaches, while outlining the improvements brought by introducing dynamic features and exploiting prosodic features.L'article présente une approche audio-visuelle pour la conversion de locuteur, basée sur des méthodes statistiques initialement proposées pour la conversion de voix. En utilisant le corpus audiovisuel BIWI 3D, des modèles de conversion entre locuteurs sont calculés séparément pour la voix et les expressions faciales. Les résultats obtenus en combinant les deux modalités sont comparés subjectivement avec d'autres méthodes et démontrent l'importance de la dynamique et de la prosodie

    Audio-Visual Speaker Conversion using Prosody Features

    No full text
    International audienceThe article presents a joint audio-video approach towards speaker identity conversion, based on statistical methods originally introduced for voice conversion. Using the experimental data from the 3D BIWI Audiovisual corpus of Affective Communication, mapping functions are built between each two speakers in order to convert speaker-specific features: speech signal and 3D facial expressions. The results obtained by combining audio and visual features are compared to corresponding results from earlier approaches, while outlining the improvements brought by introducing dynamic features and exploiting prosodic features.L'article présente une approche audio-visuelle pour la conversion de locuteur, basée sur des méthodes statistiques initialement proposées pour la conversion de voix. En utilisant le corpus audiovisuel BIWI 3D, des modèles de conversion entre locuteurs sont calculés séparément pour la voix et les expressions faciales. Les résultats obtenus en combinant les deux modalités sont comparés subjectivement avec d'autres méthodes et démontrent l'importance de la dynamique et de la prosodie

    What the Future Brings: Investigating the Impact of Lookahead for Incremental Neural TTS

    Full text link
    In incremental text to speech synthesis (iTTS), the synthesizer produces an audio output before it has access to the entire input sentence. In this paper, we study the behavior of a neural sequence-to-sequence TTS system when used in an incremental mode, i.e. when generating speech output for token n, the system has access to n + k tokens from the text sequence. We first analyze the impact of this incremental policy on the evolution of the encoder representations of token n for different values of k (the lookahead parameter). The results show that, on average, tokens travel 88% of the way to their full context representation with a one-word lookahead and 94% after 2 words. We then investigate which text features are the most influential on the evolution towards the final representation using a random forest analysis. The results show that the most salient factors are related to token length. We finally evaluate the effects of lookahead k at the decoder level, using a MUSHRA listening test. This test shows results that contrast with the above high figures: speech synthesis quality obtained with 2 word-lookahead is significantly lower than the one obtained with the full sentence.Comment: 5 pages, 4 figure

    Notes on the use of variational autoencoders for speech and audio spectrogram modeling

    Get PDF
    International audienceVariational autoencoders (VAEs) are powerful (deep) generative artificial neural networks. They have been recently used in several papers for speech and audio processing, in particular for the modeling of speech/audio spectrograms. In these papers, very poor theoretical support is given to justify the chosen data representation and decoder likelihood function or the corresponding cost function used for training the VAE. Yet, a nice theoretical statistical framework exists and has been extensively presented and discussed in papers dealing with nonnegative matrix factorization (NMF) of audio spectrograms and its application to audio source separation. In the present paper, we show how this statistical framework applies to VAE-based speech/audio spectrogram modeling. This provides the latter insights on the choice and interpretability of data representation and model parameterization

    Adaptive Latency for Part-of-Speech Tagging in Incremental Text-to-Speech Synthesis

    No full text
    International audienceIncremental text-to-speech systems aim at synthesizing a text 'on-the-fly', while the user is typing a sentence. In this context, this article addresses the problem of the part-of-speech tagging (POS, i.e. lexical category) which is a critical step for accurate grapheme-to-phoneme conversion and prosody estimation. Here, the main challenge is to estimate the POS of a given word without knowing its 'right context' (i.e. the following words which are not available yet). To address this issue, we propose a method based on a set of decision trees estimating online whether a given POS tag is likely to be modified when more right-contextual information becomes available. In such a case, the synthesis is delayed until POS stability is guaranteed. This results in delivering the synthetic voice in word chunks of variable length. Objective evaluation on French shows that the proposed method is able to estimate POS tags with more than a 92% accuracy (compared to a non-incremental system) while minimizing the synthesis latency (between 1 and 4 words). Perceptual evaluation (ranking test) is then carried in the context of HMM-based speech synthesis. Experimental results show that the word grouping resulting from the proposed method is rated more acceptable than word-byword incremental synthesis

    Is it really always only the others who are to blame? GP’s view on medical overuse. A questionnaire study

    Get PDF
    Background Medical overuse is a common problem in health care. Preventing unnecessary medicine is one of the main tasks of General Practice, so called quaternary prevention. We aimed to capture the current opinion of German General Practitioners (GPs) to medical overuse. Methods A quantitative online study was conducted. The questionnaire was developed based on a qualitative study and literature search. GPs were asked to estimate prevalence of medical overuse as well as to evaluate drivers and solutions of medical overuse. GPs in Bavaria were recruited via email (750 addresses). A descriptive data analysis was performed. Additionally the association between doctors’ attitudes and (1) demographic variables and (2) interest in campaigns against medical overuse was assessed. Results Response rate was 18%. The mean age was 54 years, 79% were male and 68% have worked as GP longer than 15 years. Around 38% of medical services were considered as medical overuse and nearly half of the GPs (47%) judged medical overuse to be the more important problem than medical underuse. Main drivers were seen in “patients´ expectations” (76%), “lack of a primary care system” (61%) and “defensive medicine” (53%), whereas “disregard of evidence/guidelines” (15%) and “economic pressure on the side of the doctor” (13%) were not weighted as important causes. Demographic variables did not have an important impact on GPs´ response pattern. GPs interested in campaigns like “Choosing Wisely” showed a higher awareness for medical overuse, although these campaigns were only known by 50% of the respondents. Discussion Medical overuse is an important issue for GPs. Main drivers were searched and found outside their own sphere of responsibility. Campaigns as “Choosing Wisely” seem to have a positive effect on GPs attitude, but knowledge is still limited

    Repeat after me: Self-supervised learning of acoustic-to-articulatory mapping by vocal imitation

    Full text link
    We propose a computational model of speech production combining a pre-trained neural articulatory synthesizer able to reproduce complex speech stimuli from a limited set of interpretable articulatory parameters, a DNN-based internal forward model predicting the sensory consequences of articulatory commands, and an internal inverse model based on a recurrent neural network recovering articulatory commands from the acoustic speech input. Both forward and inverse models are jointly trained in a self-supervised way from raw acoustic-only speech data from different speakers. The imitation simulations are evaluated objectively and subjectively and display quite encouraging performances
    • …
    corecore